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Abstract 

We describe the underlying probabilistic interpretation of alpha and beta divergences. We first show that beta 
divergences are inherently tied to Tweedie distributions, a particular type of exponential family, known as exponential 
dispersion models. Starting from the variance function of a Tweedie model, we outline how to get alpha and beta 
divergences as special cases of Csiszar's / and Bregman divergences. This result directly generalizes the well-known 
relationship between the Gaussian distribution and least squares estimation to Tweedie models and beta divergence 
minimization. 

Index Terms 

Tweedie distributions, variance functions, alpha/beta divergences, deviance. 

I. Introduction 

In fitting a model to data, the error between the model prediction and observed data can be quantified by a 
divergence function. The sum-of-squares (Euclidean) cost is an example of such a divergence. It is well know, that 
minimizing the sum-of-squares error is equivalent to assuming a Gaussian distributed error term and leads to a 
least squares algorithm. In the recent years, researchers have started using alternative divergences in applications 
such as KL (Kullback-Leibler) [1] or Itakura-Saito (IS) [2] divergences. It turns out, that these divergences are 
special cases of a more general family of divergences known as /3-divergence [3], A different but related family are 
the a-divergences. Iterative divergence minimization algorithms exist for both families [3], however it is often not 
clear which divergence should be used in an application and it is not clear what the equivalent noise distribution 
is. In this context, our goal is to survey and investigate results about the relationship between a and /3-divergences 
and their statistical interpretation as a noise model. We believe that it is valuable to have a framework where 
different divergence functions can be handled without having to invent optimization algorithms from scratch, an 
aspect of central importance in practical work. We finish the paper by illustrating how the best divergence function 
can be chosen by maximum likelihood. Moreover, having a deeper understanding of the statistical interpretation of 
divergence functions could further facilitate model assessment, comparison and improvement. 

The motivation of this technical report is i) to present the central role of the variance functions in unifying a 
and /3-divergences and Tweedie distributions, ii) to provide a compact and simple derivation that unify many results 
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scattered in the statistics and information theory literature about a and /3-divergences related to Tweedie models. 
The main observations and contributions of this report are: 

1) We show that the dual cumulant function of Tweedie distributions generates a and /3-divergences. 

2) We simplify and unify the connection between a and /3-divergences, related scale-invariance properties, the 
fact that KL divergence is the unique divergence that is both an a and a /3-divergence and conditions for 
symmetric a-divergences. 

3) The /3-divergence is shown to be equivalent to statistical unit deviance, the scaled log-likelihood ratio of a 
full model to a parametric model. 

4) The density of dispersion models is reformulated using /^-divergences. 

Probability models and divergences are inherently related concepts as shown by various studies; Banerjee et al. prove 
the bijection between Bregman divergences and exponential family distributions [4]. Cichocki et al. mention the 
connection between Tweedie distributions and /3-divergences in their seminal book [3], but very briefly in a single 
paragraph. Our paper carries their observation one step further by establishing the mathematical formalization 
based on the the concept of variance functions [5]. A variance function defines the relationship between the 
mean and variance of a distribution. For example, the special choice of no functional relationship between the 
mean and variance (as in linear regression) implies Gaussianity. We show that a power relationship is sufficient to 
derive /3-divergences from Bregman divergences and a-divergences from /-divergences. This result shows us that 
using a /3-divergence in a model is actually equivalent to assuming a Tweedie density. For a-divergences, such a 
direct interpretation is less transparent; but we illustrate a very direct connection to /3-divergences and the implicit 
invariance assumptions about data when a-divergences are used. 



To make the notation simpler we drop the sum from the equations and simply work with scalar divergences d. In 
particular, df(x,p,) denotes /-divergence between x and /i generated by the convex function /. Similarly d${-, •) 
is the Bregman divergence generated by convex function <f>, whereas d a (-, •) and dp(-, •), simply a//3, will denote 
alpha (a) and beta (/3) divergences as special cases. Provided that type of divergence (alpha or beta) is clear from 
the context, we may replace alpha, beta symbols with the index parameter p such as in d p (- 7 •). Log-likelihood is 
denoted by C x (p). In this paper, we assume only univariate case and consider only scalar valued functions whereas 
the work can easily be extended to multivariate case. 



II. Background 



In this paper, we only consider separable divergences 



n 




(1) 



A. Exponential Dispersion Models and Tweedie Distributions 

Exponential Dispersion Models (EDM) are a linear exponential family defined as [6] 

p(x\6,ip) = h(x,tp)cxp{tp- 1 (6x - tp(9))} , 



(2) 
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where 9 is the canonical (natural) parameter, ip is the dispersion parameter and ip is the cumulant generating 
function ensuring normalization. Here, h(x, ip) is the base measure and is independent of the canonical parameter. 
The mean parameter (also called expectation parameter) is denoted by and is tied to the canonical parameter 9 
with the differential equations 

where (j)(pi) is the conjugate dual of ip(6) just as the canonical parameter 9 is conjugate dual of expectation parameter 
\x. The relationship between 9 and fi is more direct and given as [6] 

Here v(p) is the variance function [6]-[8], and is related to the variance of the distribution by the dispersion 
parameter 

Var(x) = <pv(fi). (5) 

As a special case of EDMs, Tweedie distributions Tw p (^i 7 ip) specify the variance function as 

v(p) = 1? (6) 

that fully characterizes the dispersion model. The variance function is related to the p'th power of the mean, therefore 
it is called a power variance function (PVF) [6], [8]. Here, the special choices of p = 0, 1, 2, 3 lead to well known 
distributions as Gaussian, Poisson, gamma and inverse Gaussian. For 1 < p < 2, they can be represented as the 
Poisson sum of gamma distributions so-called compound Poisson distribution. Indeed, a distribution exist for all 
real values of p except for < p < 1 [6]. History of Tweedie distributions goes back to Tweedie's unnoticed work 
in 1947 [9]. Nelder and Wedderburn, in 1972, published a seminal paper on generalized linear models (GLMs) 
[10], however, without any reference to Tweedie's work where the error distribution formulation was essentially 
identical to Tweedie's formulation. In 1982, Morris used the term natural exponential models (NEF) [11], and 1987 
Jorgensen [6] coined the name Tweedie distribution. 

B. Bregman Divergences and Csiszdr f -Divergences 

As detailed in the introduction, in many applications it is more convenient to think of minimization of the 
divergence between data and model prediction. Yet, probability models and divergences are inherently related 
concepts [4]. Two general families of divergences are Bregman divergences and Csiszdr f -Divergences. Bregman 
divergences are introduced by Bregman in 1967. By definition, for any real valued differentiable convex function 
(f> the Bregman divergence is given by [4] 

d<t>( x , ^) = <P( X ) - <MaO ~(x- aO0'(aO- ( 7 ) 

It is equal to tail of first-order Taylor expansion of <j>{x) at /i. Major class of the cost functions can be generated 
by the Bregman divergence with appropriate functions <p as [4] 
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d${x, fi) = < 



i(x- M ) 2 EU (j>{x) = \x 2 



xlogj;—x + [i KL (f>(x)=x\ogx 



log- — 1 IS 4>{x) = - log a; 
These functions may look arbitrary at a first sight but in the next section we will show that they follow directly 
from the power variance function assumption. 

The /-divergences are introduced independently by Csiszar [12], Morimoto and Ali & Silvey during 1960s. They 
generalize Kullback-Liebler's KL divergences dated back to 1954. By definition, for any real valued convex function 
/ providing that /(l) =0, the /-divergence is given by [12] 

d f (x, (8) 
A 4 

The Bregman and /-divergences are non-negative quantities as df(x, //) > 0. It is zero iff x — \i, i.e. df(x, x) = 0. 
Note that the divergences are not distances since they provide neither symmetry nor triangular inequality in general. 

III. TWEEDIE DISTRIBUTIONS AND ALPHA/BETA DIVERGENCES 

In this section, we will derive the link between the Tweedie distributions and a//3-divergences. We will show that 
the power variance function assumption is enough to derive both divergences; i.e., if we minimize the /3-divergence 
we are assuming a noise density with a power variance function and if we minimize the a-divergence, we assume 
a certain invariance. 

A. Derivation of Conjugate (Dual) of Cumulant Function 

Starting from the power variance assumption, we first obtain the canonical parameter 9 by solving the differential 
equation <g = fi~ p [8] 

J d9 = J fi- p dfi => 9 = 6{n) = + m (9) 

with m is the integration constant. Then we find dual cumulant function </>(•) by integrating (3) and using 9(p) in 
(9) 

A-p 



<K») = J = J ( Y~p +m\ dfi 



(10) 



The /-divergence requires </>(l) = 0, and for normalization we set 4>'(l) = 0. Using these two constraints, the 
constants of integration are determined as 

m = -l/(l-p) d=l/(2-p) (12) 
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so that the dual cumulant function becomes for p ^ 1,2 



(l-p)(2-p) 

as the limit cases for p = 1, 2 are found by l'Hopital's rule 



+ 



l-p 2-p 



(13) 



i M 2 - M + 3 



p = 
p = 2 



<Xm) = ^ A^logM - M + 1 
- log p + (J, - 1 

The same function <j> is used directly by [3], [13] to derive /3-divergences without justification. Some others [14] 
obtain it under the name standardized convex form of the functions by the Bregman divergence as fap) = d c f ) (p, 1). 

The function <j) is indeed an entropy function [3] and can generate a divergence. Similar to [15], the Shannon's 
entropy is 

H[0] = - J p(x\0, <p) \ogp{x\6, <p) dp{x) (14) 
= -V' 1 (Op - - E [log h(x, <p)\ (15) 



H[p] = -<p~ - E [log h(x, ip)\ 



(16) 



noting that </>(•) is the best entropy estimate where we maximize 9fi — ip(6) to get H[p] [16]. 

In the next section, by using the convex function <j>, we obtain /3-divergence from the Bregman divergence and 
a-divergence from the /-divergence. 



B. Beta Divergence 

The /3-divergence is proposed by [17], [18] and is related to the density power divergence [19] whereas Cichocki 
et al. [2], [3] show its relation to Bregman divergence. Indeed, by use of the dual cumulant function 0, Bregman 
divergence is specialized to the /3-divergence 



df}{x,n) 



-2-p 



xp 



+ 



/i 



2-P 



(l-p)(2-p) l-p 2-p 



(17) 



with special cases 



dp(x,p) = < 



\x 2 - xp + \n 2 



x log f - x + p 
— log - + - — 1 



p = (EU) 
p = 1 (KL) 
V = 2 (IS). 



(18) 



Note that we can ignore the initial conditions by setting m = d = in (13), as for two convex functions fa, fa such 
that fa (x ) = fa(x) + ax + b for some reals a, b, d^ (x, p) = d^ (x, p) [20]; the same divergence is generated if 
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the function <f> is tilted or translated. The class of distributions induced by the cumulant function ip are independent 
of the constants m and d [6]. After inverting (9) for solving the parameter \x 

M = ^) = {(l-p)(0-m)} 1/(1 - p) , (19) 

we obtain the cumulant function for the Tweedie distributions as 

^^fcfcC^,, (20) 

2 — p 

Indeed, we can re-parametrize the canonical parameter from 8 to 8i = 9 — m that changes tp(6i) = ip(9 + m) and 
use tp(6) and 6 rather than ip(6i) and 8\ [8]. 

Remark 1. The cumulant function parametrized by p. is obtained as 

^( M )) = (2-p)-V~ P -l), (2D 
after plugging in m = —1/(1 — p) and d = —1/(2 — p) (solve "0(0) = for d). Likewise, the canonical parameter 
is 6(p.) — (pL 1 ~ p — 1)/(1 — p) with the limit \ogp at p = 1. 

C. Alpha Divergence 

The a-divergence is a special case of the /-divergence [21] obtained by using in (13) as 

x 2 ~ p p, pl x fl 

d a (x,n) = T -— --- h- , (22) 

(l-p)(2-p) 1-p 2-p 

with the special cases 



1 (x-v) 2 

2 n 




for p 





xlog|- 


.T + /i 


for jj = 


1 


Mlogf + 


x — p 


for p 


2 


2 (x 1 ' 2 - 


M 1/2 ) 2 


f or p = 


3/2 



d a (x,p) = < 



Note the symmetry for p = 1 and p = 2. Here, p = 3/2 is for the Hellinger distance, which is a metric satisfying 
symmetry and the triangular inequality. It is a general rule, in fact, that a-divergences indexed by pi , p 2 enjoy dual 
relation as illustrated by Figure 1 

d Pl (x, h) = d P2 (/it, x) «=> Pi+P2 = 3. (23) 

The proof is based on the symmetric /-divergence df(x, p) = df* (p, x), where /* is Csiszdr dual f*(p) = /Lt/(l//i) 
[20], [22]. Note that d 3 / 2 (x,p) = ^3/2(Mj x )> which proves the symmetry for the Hellinger distance. 

Remark 2. Interestingly setting p = 2 — q in PVF as v(p) = p 2 ~ q results to more commonly used form of 
a j ' ^-divergences in the literature [2], [3], [22]: 

d a (x, h) = (q(q - I))" 1 {xip 1 -* -qx+(q- 1) M } , (24) 
dp{x,n) = (q(q - l j)- 1 {x« - (1 - q)p," - qx^ 1 } . (25) 
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0.35 r 

0.3- 



0.15 - 
0.1 - 




p 2 = -1.176 3/2 pi= 4.176 

Figure 1. Illustration of the symmetric alpha divergence. Whenever d vi (x, fi) = d P2 (/i, x) (here both are 0.3) the corresponding index values 
sum to 3, i.e. pi + P2 = 3 that consequently at the point of intersection between two curves is pi = P2 = 3/2. 



D. Connection between Alpha and Beta Divergences 
The /3-divergence can be written as 

d?(x,»)=» (l~p)(2- P ) ' (26) 

where the fraction can be identified as a-divergence. Thus the relation between two divergences is 

dp{x,n) = y 1 ~ p d a (x,ijL), (27) 

whereas [22], [23] give other connections. Note that equality holds for either if p = 1 or fi = 1 

= 1 => p = 1 or (j, = 1. (28) 

The solution for p — 1, we obtain the KL divergence, the only divergence common to a//3-divergences [3], For 
/i = 1 the divergences are trivially reduced to dual cumulant 

d (x,l)=d a (x,l) = 4>(x). (29) 

Here, /i = 1 is related to standard uniform distribution [22] with the entropy of zero that can be regarded as origin. 
Also note the relation [24] 

x x 

d a (x, n) = (Mj>(-) = fidp(-, 1). (30) 
The findings are summarized by Table I that presents the links between a-type [14] and /3-type divergences. 
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Table I 

Divergences, distributions and entropies indexed by p. The second column is ratio of /3-and ^-divergences. Note also 

SYMMETRY CONDITION IS pi + p 2 =3 FOR TWO a-DIVERGENCES. 



p 




Distribution 


Beta 


Alpha 


Entropy 





M 


Gaussian 


EU 


Pearson (\X 2 ) 




1 


1 


Poisson 


KL 


KL 


Shannon 


3 
2 


1 


Comp. Poisson 




Hellinger dist. 




2 




Gamma 


IS 


Reversed KL 


Burg 


3 


M- 2 


Inv. Gaussian 




Rev. Pearson 





E. Scale Transformation 

Tweeide models Tw p (ij, ip) are the only dispersion models that provide scale transformation property [5] as 

Vc 6 R 

cTw p ^,ip) = Tw p (cn,c 2 - p <p), (31) 

whose variance function v(fi) = \i v is scale-invariant since v(cfi) — c p v(p). This has corresponding result in 
divergence side that a//3-divergences are scale invariant [2], [22] 

dp(cx,cn) — c 2 ~ p dp(x, /i), (32) 
d a (cx, cfi) = cd a (x, fi). (33) 

IV. Statistical View of Beta Divergence 
A. Density Formulation for Dispersion Models 

Density of EDMs can be re-formulated based on /3-divergence by using the dual form of /3-divergence [4] 

df)(x,ii,)-<j>(x) = -xe + i>(6) (34) 

that by plugging it in the EDM density, we obtain 

p(x; (J,, <p) = h(x, ip) exp{^ _1 ((/>(ar) - dp(x, //))} (35) 

= g(x,ip)exp{-ip- 1 dp(x,/i)}. (36) 
Here the base measures h(-, •) and g(-, •) are related as 

g(x, ip) = h(x, <p) exp{<p~ 1 <}>{x)}. (37) 

Example 1. The Gaussian distribution with dispersion ip = a 2 can be expressed as a EDM [5] 

, ?\ , n — I 1 / U 2 ! 

P{X\ A«, <J ) = (27TC7 z ) 2 cxp cxp i — I x A* — — > 

h(x )¥ >) W V(f(M)) 
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the usual form is already expressed as a ^-divergence 

p(x;p,a 2 ) = ^(27rg 2 )~^ cxp j - -^(x,^) j 

Example 2. For the gamma distribution with shape parameter a and inverse scale parameter b , the density is 

x a ~ 1 

p(x; a, b) = exp{-&£ + a log b} (38) 
L (a) 

The mean and variance of a gamma distribution are given as p,= a/b and Var(x) = a/b 2 . Hence, the dispersion 
becomes ip = 1/a and we have a = l/<p and b = l/(pip). Then the density in terms of mean and dispersion 
parameters are given as 

x a_1 1 

p(x;p,a) = —— -a a exp{a( x-log^)} (39) 

+ \ a ) ^~Ji^ s — * — ' 

h(x,<p) e(fj.) 

or equivalently after adding and subtracting log x + 1 from the exponent we obtain 

X Q, a CXT)f — ft) 

p(x; p, a) = '■ — — exp{-adp(x, p)} (40) 

S * ' 

Example 3. For the Poisson distribution with dispersion tp = 1, the density is given as [5] 

p{x;p) = — y exp{xlog/i- p } (41) 

that after adding and subtracting x log x — x in the exponent we obtain equivalent beta representation of the density 

p(x; p) = exp{-dp(x, p)} (42) 

This form of density formulation differs from so-called standard form of dispersion models [5] only the factor 
1/2 in the exponent, that is expressed as 

p(x; p, tp) = g(x, (p) exp{-^ _1 d 1/ (or, p)}, (43) 

where d v is unit deviance ('unit' implies that deviance is scaled by dispersion). The deviance is a statistical term to 
qualify the fit of the statistics to the model [10] and is equal to 2 times of the log-likelihood ratio of the full model 
to parametrized model. Log-likelihood of the full model is the maximum achievable likelihood and independent of 
the parameter p. Hence /3-divergence is linked to the unit deviance and the log-likelihood ratio for given dispersion 

tp as 

dp(x,p) = ^d v (x,p) = tp{c x (x) -£ x (/i)j. (44) 
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B. Parameter Optimization 

The last equation (44) has an immediate result for optimization of expectation parameter /j, for given p and tp 

ddfj(x,n) = dC x {fi) = (x - /i) ^ 

that the opposite sign implies minimizing /3-divergence is equal to the maximizing log-likelihood. Whereas the 
optimization wrt \i is trivial, the likelihood equation derived by simply plugging 9(p) and ^}{6p)) in log-density of 
EDM 

f xu}~ p ri 2 ~ p 1 

C x (n,ip,p)=(p<- } + logh(x,(p,p) (46) 

presents difficulty for optimization of p and ip due to that the base measure h(x, <p,p) has no closed forms except 
for certain values of p as for p = 0,1,2,3. For others, such as p G (1,2) for compound Poisson the function h 
is expressed as series [5]. There are a number of approximating techniques one that saddlepoint approximation, 
that is interpreted as being half way between original density and Gaussian approximation as vanishing dispersion 
<p — >■ [5]. Others are Fourier inversion of cumulant generating function and direct series expansion where we refer 
to [25] for the details. 
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